HPCC-32795 Additional optimizations to replaceString #19239

jackdelv · 2024-10-25T12:01:34Z

Does nothing if no search string is supplied or if it is larger than the source string
Compare directly if search and source string are equal in length

Type of change:

This change is a bug fix (non-breaking change which fixes an issue).
This change is a new feature (non-breaking change which adds functionality).
This change improves the code (refactor or other change that does not change the functionality)
This change fixes warnings (the fix does not alter the functionality or the generated code)
This change is a breaking change (fix or feature that will cause existing behavior to change).
This change alters the query API (existing queries will have to be recompiled)

Checklist:

Smoketest:

Send notifications about my Pull Request position in Smoketest queue.
Test my draft Pull Request.

Testing:

- Does nothing if no search string is supplied or if it is larger than the source string - Compare directly if search and source string are equal in length Signed-off-by: Jack Del Vecchio <[email protected]>

github-actions · 2024-10-25T12:15:46Z

Jira Issue: https://hpccsystems.atlassian.net//browse/HPCC-32795

Jirabot Action Result:
Workflow Transition To: Merge Pending
Updated PR

dcamper

I think this needs a slight change.

dcamper · 2024-10-25T12:17:42Z

system/jlib/jstring.cpp

+        if (oldlen == curLen)
+        {
+            if (memcmp(buffer, oldStr, oldlen) == 0)
+                temp.append(newlen, newStr);


I would think this would be more efficient to copy newStr directly into buffer, rather than going through temp.

If you make that change, then temp and the call to swap() can be moved to the same block.

I agree it would be more efficient to copy directly and only use the temp buffer when necessary.

- Add unit tests for special case

dcamper

Looks good.

ghalliday

This PR only partially implements the suggestions in the jira ticket.

No replacements take place
This covers the case where the text is smaller than the search text, but does not cover the more general case where there were matches.
Hint: What parameter could be passed, and how could a common function efficiently interpret it to avoid a memcpy and memory allocation if no search string was found?

The source and target replacement strings are the same length
You have implement the source and current string are the same length.
Hint: How can you optimize and in-place replace if the source and target strings are the same length?

There is actually a 3rd case:

The replacement text is shorter than the search text
Question: What optimization does this enable for an in-place replacement? Can you efficiently common this up with case (2) to avoid duplicating too much code.

ghalliday · 2024-10-28T14:54:45Z

system/jlib/jstring.cpp

+        {
+            if (memcmp(buffer, oldStr, oldlen) == 0)
+            {
+                memcpy(buffer, newStr, newlen);


If I was reviewing this code (I am not because of general review comments) - this has a serious memory corruption if newlen > oldlen

Fixed. I added ensureCapacity(newlen).

- Add ensureCapacity(newlen)

- Add additional parameter to replaceString that returns whether a match was found - Wait to allocate memory until a match is made rather than at the beginning of the search - Avoid copying source into result if source wasn't changed

- Move the target into the source rather than copy and skip allocating. - Add tests for each case of target length

jackdelv · 2024-10-29T11:28:48Z

system/jlib/jstring.cpp

        while (offset < maxOffset)
        {
            if (unlikely(source[offset] == firstChar)
                && unlikely((lenOldStr == 1) || memcmp(source + offset, oldStr, lenOldStr)==0))
            {
+                // Wait to allocate memory until a match is found
+                if (unlikely(!foundMatch))


@ghalliday I am not sure I got this correct. I chose unlikely because it will only ever be true on the first match, so any string with multiple matches will get some benefit. If this is rare and in most cases there is only a single match then it should probably be likely. Is that correct?

Most of the time I wouldn't worry about adding likely/unlikely. The case where it is may be important is in inner loops that are the critical points. Once we have a match it is much less critical.

jackdelv · 2024-10-29T11:32:35Z

@ghalliday I believe I got all the optimizations you were asking for.

I changed replaceString to wait to call ensureCapacity until a match has been found. This way if no matches are found the toplevel replaceString never allocated any extra memory and returns itself instead of swapping with temp.
If the source string is the same size as the target, the memory can be moved rather than copy because we know there is space. This was the one I was least sure about. Is there a better way to move the memory?
If the target is smaller then the first step we can do the same thing as optimization 2, but we have to adjust the curLen.

Let me know if I got any of these wrong. Back to you.

ghalliday

I think this implements the logic to not copy if it hasn't been modified - although a couple of suggestions to clean up.
It doesn't tackle the second suggestion of replacement text <= match text

ghalliday · 2024-10-30T12:49:50Z

rtl/eclrtl/eclrtl.cpp

-    ::replaceString(result, rtlUtf8Size(scriptChars, script), script, rtlUtf8Size(searchChars, search), search, rtlUtf8Size(outFieldsChars, outFields), outFields);
+    bool foundMatch = false;
+    size_t sourceLen = rtlUtf8Size(scriptChars, script);
+    ::replaceString(result, sourceLen, script, rtlUtf8Size(searchChars, search), search, rtlUtf8Size(outFieldsChars, outFields), outFields, foundMatch);


if this is the only place replaceString is called then I would consider a slightly different approach.
i) Pass a flag to indicate whether to copy if there is no match
ii) change the return type to be a boolean and return whether or not a replace occurred.

otherwise it is a slightly strange semantics for a pubic function.

Changed to return a bool indicating if a match was found and takes a parameter that sets whether to copy regardless of a match being found.

ghalliday · 2024-10-30T12:52:16Z

system/jlib/jstring.cpp

        while (offset < maxOffset)
        {
            if (unlikely(source[offset] == firstChar)
                && unlikely((lenOldStr == 1) || memcmp(source + offset, oldStr, lenOldStr)==0))
            {
+                // Wait to allocate memory until a match is found
+                if (unlikely(!foundMatch))


Most of the time I wouldn't worry about adding likely/unlikely. The case where it is may be important is in inner loops that are the critical points. Once we have a match it is much less critical.

- Returns a bool indicating whether a match was made - Accepts a parameter to force a copy even if no match was found

HPCC-32795 Additional optimizations to replaceString

6f71021

- Does nothing if no search string is supplied or if it is larger than the source string - Compare directly if search and source string are equal in length Signed-off-by: Jack Del Vecchio <[email protected]>

jackdelv requested review from ghalliday and dcamper October 25, 2024 12:01

dcamper requested changes Oct 25, 2024

View reviewed changes

Directly copy newStr to buffer in special case.

4bb7a82

- Add unit tests for special case

jackdelv requested a review from dcamper October 25, 2024 14:54

dcamper approved these changes Oct 25, 2024

View reviewed changes

ghalliday requested changes Oct 28, 2024

View reviewed changes

ghalliday reviewed Oct 28, 2024

View reviewed changes

jackdelv added 3 commits October 28, 2024 11:12

Fix memory corruption issue

98315c6

- Add ensureCapacity(newlen)

Optimize for case where no matches are found

19984dd

- Add additional parameter to replaceString that returns whether a match was found - Wait to allocate memory until a match is made rather than at the beginning of the search - Avoid copying source into result if source wasn't changed

Optimize for case where target is less than or equal to source string.

b6dc378

- Move the target into the source rather than copy and skip allocating. - Add tests for each case of target length

jackdelv commented Oct 29, 2024

View reviewed changes

jackdelv requested a review from ghalliday October 29, 2024 11:33

ghalliday reviewed Oct 30, 2024

View reviewed changes

jackdelv added 3 commits October 30, 2024 11:04

Change return type of replaceString function

e3364e7

- Returns a bool indicating whether a match was made - Accepts a parameter to force a copy even if no match was found

Remove extra line

b57c932

Remove duplicate code.

45800b0

jackdelv requested a review from ghalliday October 31, 2024 18:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HPCC-32795 Additional optimizations to replaceString #19239

HPCC-32795 Additional optimizations to replaceString #19239

jackdelv commented Oct 25, 2024 •

edited

Loading

github-actions bot commented Oct 25, 2024

dcamper left a comment

dcamper Oct 25, 2024

jackdelv Oct 25, 2024

dcamper left a comment

ghalliday left a comment

ghalliday Oct 28, 2024

jackdelv Oct 29, 2024

jackdelv Oct 29, 2024

ghalliday Oct 30, 2024

jackdelv commented Oct 29, 2024 •

edited

Loading

ghalliday left a comment

ghalliday Oct 30, 2024

jackdelv Oct 30, 2024

ghalliday Oct 30, 2024

HPCC-32795 Additional optimizations to replaceString #19239

Are you sure you want to change the base?

HPCC-32795 Additional optimizations to replaceString #19239

Conversation

jackdelv commented Oct 25, 2024 • edited Loading

Type of change:

Checklist:

Smoketest:

Testing:

github-actions bot commented Oct 25, 2024

dcamper left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcamper left a comment

Choose a reason for hiding this comment

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackdelv commented Oct 29, 2024 • edited Loading

ghalliday left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jackdelv commented Oct 25, 2024 •

edited

Loading

jackdelv commented Oct 29, 2024 •

edited

Loading